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Abstract. We consider how to index strings, trees and graphs for jum- 
bled pattern matching when we are asked to return a match if one exists. 
For example, we show how, given a tree containing two colours, we can 
^AJ ■ build a quadratic-space index with which we can find a match in time 

II ' proportional to the size of the match. We also show how we need only 

jyT , linear space if we are content with approximate matches. 

o , 

1 Introduction 

> 

^^ . Suppose we are given a connected graph G on n coloured nodes and a multiset 

^^ ' M of colours and asked to find a connected subgraph of G whose nodes' colours 

'^ . are exactly those in M, if such a subgraph exists. Even when G is a tree there can 

be exponentially many such matching subgraphs. When G is a path, however, 
■^ . there are 0{n) matches and we can find them all in 0{n) time [5]. When G is 

>— ^ ' a path containing a constant number of colours, in Oyn^) time we can build a 

o(n^ )-space index with which we can determine in o{n) time whether there is 
a match [7]. When G is a path containing only two colours, in ©(n^/log^n) 
time we can build an C'(n)-bit index with which we can determine in 0(1) time 
whether there is a match [3111916] . It follows that in ©(n^/log nj time we can 
^ . build an index of size C'(nlogn)-bits with which we can find all the matches 

using OdAfj) worst-case time per match [6]. We can build an approximation 
of this index in 0(ri^+'^) time with the quality of the approximation depending 
on e [1]. Throughout this paper our model is the word- RAM with i7(logn)-bit 
words and we measure space in words unless stated otherwise. 

Determining whether there is a match is NP-complete even when G is a 
tree |8] or when it contains only two colours, but takes polynomial time when G 
both has bounded treewidth and contains only a constant number of colours [S] . 
When G contains only two colours there exists an 0(n)-bit index with which we 
can determine in 0(1) time whether there is a match [^. Building this index is 
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NP-hard in general but, since finding a match is self-reducible, takes polynomial 
time when G has bounded treewidth and 0(in? /\o^ n) time when G is a tree. 
At the cost of increasing the space to 0{n) words, this index can be generalized 
to return a subset of the nodes in the matches that is also a hitting set for all 
the matches, using O{\ogn) time worst-case time per match. In the worst case, 
however, this subset of nodes is of little use in finding even a single complete 
match. 

We start by presenting some basic tradeoffs in Section [21 In Sections [3] to [S] 
we assume G contains only two colours. In Section [3] we consider the case when 
G is a path — i.e., a binary string — and describe an C'(n)-space index with 
which we can find a match in O{\ogn) time. In Section |4] we consider the case 
when G is a tree and, based on our index for binary strings, describe an 0{ii?^- 
space index with which we can find a match in OdMl) time. If we are concerned 
only with multisets of size at most n}'"^ , then we can reduce the space bound 
to 0{n). In Section [S] we show that we can achieve the same space bound if we 
are content with approximate matches. In the full version of this paper we will 
partially extend our results to graphs, by working on spanning trees. 

2 Basic Tradeoffs 

Suppose G is a graph containing a constant number c of colours and we will be 
given M as the vector of length c whose components are the frequencies of the 
characters, which is called the Parikh vector for M. Since there are (™j^]~ ) = 
0(m^~^) possible multisets of size m and it takes 0{m) space to store pointers 
to a match for such a multiset, there exists an 0(n'^"'"^)-space index with which 
we can find a match in OdAfj) time. When G has bounded treewidth we can 
build this index in polynomial time, and we can reduce the space bound to 0{n) 
at the cost of increasing the query time to |Mp(^^. To do the latter, we store 
G itself and pre-compute and store pointers to matches only for multisets of 
size at most n^^^'^^^K Given a multiset M with \M\ > n^/(^+-'^', we search G in 
„0(i) ^ \M\^W time. 

For any positive constant e, we can build an 0{n\og'^ n)-space approximate 
index with which, if M has an exact match, then in 0(1) time we can find a sub- 
string whose Parikh vector differs from M's by at most a factor of 1 -I- e in each 
component. (This index does not tell us whether M has an exact match, however, 
since we may find such a substring even when it does not.) Without loss of gen- 
erality, assume we are concerned only with multisets in which each character ap- 
pears at least once; we can reduce the general case to 2'^ = 0(1) instances of this 
one. We store a c-dimensional grid with each side having length [log^^^^ n\ + 1. 
For each point (xq, . . . , a^c-i) in this grid, we store pointers to the nodes in a con- 
nected subgraph whose Parikh vector is between ((1 -|- e)^", . . . , (1 -f e)^"-^) and 
((1 + e)^o+\ . . . , (1 + epsi/ori)^<=-i+i). r^j^jg takes a total of 0(n log^^n) space. 
Given the Parikh vector (vq, . . . , Wc-i) of M, we return the subgraph stored for 
the point ( [log^^^ vo\,- ■ ■ , Uogi+e "c-i J ) in the grid, if that subgraph exists. We 
summarize these basic tradeoffs in the following lemma: 



Lemma 1. When G is a graph containing a constant number c of colours there 
exists an 0(n'^^^)- space index with which we can find a match in OdAf |) time. 
For any positive constant e there exists an 0{n\og'^ n)-space index with which in 
0{\M\) time we can find an approximate match in which each colour's frequency 
is within a factor of 1 + e of its frequency in M . When G has hounded treewidth 
we can build these indexes in polynomial time and, moreover, we can reduce 
the space of the exact index to 0{n) at the cost of increasing the query time to 
|M|0(i). 

When G is a path — which we can think of as a string over an alphabet of c 
characters — we can improve these bounds. Since G contains ©(n^) substrings 
and we can specify any substring by its two endpoints, we can build an 0{n^^- 
space index with which we can find a match in 0(1) time. Calculation shows 
we can reduce the space bound to 0{n) at the cost of increasing the query 
time to 0(|M|'^), and we can store an approximate index in C'(log'^ n) space. In 
Appendix \K\ we show how in 0(71,^+^) expected time we can build an index with 
which we can find all occ matches ofM in C'(|Mp/'^ + occ) time. 

As an aside, we note that we can extend our approximate indexes to support 
approximate scaled-then-permuted pattern matching (see [5]). To do this, for 
each point (xq, . . . , Xc-i) in the grid for which there is no subgraph whose Parikh 
vector is between ((1 + e)^°, . . . , (1 -H e)^<=-i) and ((1 + e)^o+i, ...,{l + e)^-i+i) , 
we store pointers to the nodes in a connected subgraph (if there is one) whose 
Parikh vector is a multiple of a one between ((1 + e)^'% . . . , (1 + e)^'^"^) and 
((1 -I- e)^°+^, ...,(!-(- eY^^^^^Y The query time is still proportional to the size 
of the match returned but that may now be larger than \M\. 

3 An Index for Binary Strings 

Suppose G is a binary string, i.e., G[l..n] G {0,1}*. If there are p copies of 1 
in G[i..i + m — 1] and r copies of 1 in G[k..k + to — 1], then for every value q 
between p and r there is a position j between i and k such that G[j..j -I- to, — 1] 
contains q copies of 1. This observation was the basis for the index in [3^ and is 
the basis for ours as well. 

We store an C'(l)-time rank data structure for G and, for 1 < to, < n, we 
store the endpoints of two substrings of length m in G with the most and with 
the fewest copies of 1. This takes a total of 0{n) space. Given a Parikh vector 
(vq, vi), we look up the left endpoints i and j of the substrings of length vq + vi 
in G with the most and with the fewest copies of 1. We set i and j as the initial 
endpoints for a binary search: at each step, we use two rank queries to find the 
number q of Is in G [[-^J .. [^^J + "^o + "^^i — l] ; if 9 = ''^i then we stop and 
report this substring by its endpoints; ii q < vi then we set z = L(* + j)/2j and 
continue; ii q > vi then we set j = [(i -|- j)/2j and continue. This search takes 
a total of 0{logn) time. 

Theorem 1. When G is a path containing only two colours, we can build an 
0{n)-space index with which we can find a match in 0(logn) time. 



4 Exact Indexes for Trees with Two Colours 

Suppose G is a tree containing only two colours, black and white. Gagie, Her- 
melin, Landau and Weimann [B] noted that the observation in Section [3] can be 
extended to connected graphs: if there are connected subgraphs Hp and Hr in 
G with m nodes each and p and r white nodes, respectively, then for every value 
q between p and r, there is a connected subgraph Hq with m nodes and q white 
nodes. 

To see why, notice that we can construct a sequence of connected subgraphs 
with m nodes such that the sequence starts with Hp and ends with Hr and any 
consecutive pair of subgraphs in the sequence differ on two nodes. To build this 
sequence, we find a path between Hp and Hr. We root Hp and Hr, which are 
trees themselves, at the first and last nodes in the path (or at a shared node, 
if they are not disjoint). One by one, we remove nodes bottom- up in Hp and 
add nodes along the path; remove nodes nearest to Hp in the path and add 
nodes further along the path; then remove nodes from the path and add nodes 
top-down in Hr- 

Suppose p and r are the minimum and maximum numbers of white nodes in 
any connected subgraphs of size m, and we store a path consisting of the nodes 
in Hp in bottom-up order, followed by the nodes in the path, followed by the 
nodes in Hr in top-down order. If we apply Theorem [T] to this path, then we 
obtain an C'(n)-space index with which, given the Parikh vector for a multiset 
M with \M\ = TO, we can find a match in the graph G in O(logn -I- \M\) time. 
Notice that, if \M\ < logn, then we can simply store an (^(log^ n)-space lookup 
table with which we can find a match in 0(|Af |) time. Therefore, applying this 
construction for 1 < to, < n, we obtain the following theorem: 

Theorem 2. When G is a tree containing only two colours, we can build an 
0(n^)-space index with which we can find a match in 0{\M\) time. 

When m ^ n, we need 0{n) space to store subgraphs with the minimum and 
maximum numbers of white nodes and the path between them. When m <^ n, 
however, those subgraphs are small and most of the space is taken up by the 
path. We now claim we can store G such that we can support fast rank queries 
on paths; due to space constraint, we leave the proof to Appendix [Bl 

Lemma 2. We can store G in 0{n) space such that q rank queries on the path 
between any two nodes take a total of 0{logn + q) time. 

If we store G with Lemma [2] and store subgraphs with the minimum and 
maximum numbers of white nodes only for 1 < to < n^", then our index takes 
only 0{n) space but supports queries only for \M\ < n^/^. When \M\ > n^l"^ 
we can use an algorithm by Gagie et al. to find a match in 0(|Af |n) — ©(jAf p) 
time. 

Corollary 1. When G is a tree containing only two colours, we can build an 
0{n)-space index with which we can find a match in OdAfl) time when \M\ < 
n^i"^ and in 0\\M\^^ time otherwise. 



5 An Approximate Index for Trees with Two Colours 

In this section we present our most technical result, which is how to store in 
0{n) space an approximate index for a tree containing only two colours. Again, 
an approximate match is one whose Parikh vector differs from M's by a factor 
of at most 1 + e in each component. In contrast, with Lemma [T] we would use 
©(n log^ n) space. Without loss of generality, assume we are only concerned with 
multisets in which there are at least as many black nodes as white nodes; we can 
build a symmetric index for the other case. Notice that in this case, if we can 
find a connected subgraph H with the same size as the given multiset M and 
in which the number of white nodes is within a factor of 1 + e of the number in 
M, then the number of black nodes in H is also within a factor of 1 + e of the 
number in M . 

Our main idea is to store an C'(n)-space data structure with which, given 
a size m, we can find two connected subgraphs with size m that have approxi- 
mately the minimum and maximum numbers of white nodes. Suppose we store 
a subgraph with the minimum number of white nodes for each size that is a 
power of two and for each size such that the minimum number of white nodes is 
a factor of 1 + e greater than the number in the preceding stored subgraph. That 
is, we store a sequence of Ign subgraphs with total size 0{n) and a sequence of 
log]^_i_g n subgraphs with total size O(nlogn). The latter sequence of subgraphs 
has total size O(nlogn) in the worst case because the minimum number of white 
nodes may stay low until we reach size nearly n and then increase rapidly, caus- 
ing us to store about log]^_,_j n subgraphs each of size nearly n. However, we can 
store this sequence of subgraphs in a total of 0{n) space using the following 
lemma, which we prove in Appendix |B] Similarly, we also store a subgraph with 
the maximum number of white nodes for each size that is a power of two and 
for each size such that the maximum number of white nodes is a factor of 1 -I- e 
greater than the number in the preceding stored subgraph; this also takes 0{n) 
total space if we store the subgraphs with the following lemma. 

Lemma 3. We can store G in 0{n) space such that, if G contains a connected 
subgraph of size m with w white nodes, then we can represent some such subgraph 
in 0{w) space such that recovering this subgraph takes 0{m) time. 

If we are given a multiset M such that we have subgraphs of size \M\ sampled, 
then we can proceed as in the proof of Theorem[2]and find an exact match if there 
is one. If we do not have subgraphs of size \M\ sampled, then we use our sampled 
subgraphs to build subgraphs i?min and -ffmax of size \M\ with approximately 
minimum and maximum numbers of white nodes, then proceed almost as in the 
proof of Theorem [21 if the number of white nodes ffmin is larger but within a 
factor of 1 -|- e of the number in M, then we return iJmin; if the number in ffmin 
is more than a factor of 1 -I- e larger than the number in M, then there is no exact 
match and we return nothing; if the number of white nodes iJmax is smaller but 
within a factor of 1 -I- e of the number in M, then we return H^aax', if the number 

e smaller than the number in M, then there 



is no exact match and we return nothing; in all other cases, we proceed as in 
Theorem H 

To build -ffmin we take the next larger subgraph with a minimum number of 
white nodes and discard nodes until it has size \M\ while leaving it connected. 
This next larger subgraph has size less than 2|M|, because we sampled for every 
size that is a power of two; has at most 1 + e times more white nodes than 
the subgraph of size \M\ with the minimum number of white nodes, because we 
sampled whenever the minimum number of white nodes increased by a factor of 
1 + e; and is a tree, because it is a connected subgraph of a tree. It follows that 
discarding nodes takes 0(|M|) time and, since discarding nodes cannot increase 
the number of white nodes, -ffmin contains at most 1 + e times the minimum 
number of white nodes. To build i^max we take the next smaller subgraph with 
a maximum number of white nodes and add nodes until it has size |A/|. By 
symmetric arguments, this takes OdMl) time and, since adding nodes cannot 
decrease the number of white nodes, the maximum number of white nodes in a 
subgraph of size \M\ is at most 1 + e times the number in -ffmax- Finding the path 
from i?niin to -ffmax takcs ©(jAf I) time using the representation from Lemma [H 

Theorem 3. When G is a tree containing only two colours, for any positive 
constant e we can build an 0{n) -space index with which in 0{\M\) time we can 
find an approximate match in which each colour's frequency is within a factor of 
1 + e of its frequency in M . 
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A An Index for Strings over Constant-Size Alphabets 

Suppose G is a string over a constant-size alphabet and < e < 1. Then in 
Oin^''^'^) expected time we can build an index with which, given a multiset M 
of characters, we can find all occ matches of M in C'(|M|^/'^ + occ) worst-case 
time. To do this, we store G itself and, for 1 < m < n'^, we make a pass over 
G and store, for each multiset of size m that has a match in G, a list of all 
the locations of that multiset's matches. Notice the lists for multisets of size m 
are disjoint and have total length n — m + 1; therefore, with dynamic perfect 
hashing we use a total of 0(n^+^) expected time and 0(n^^'^) space. Given a 
multiset M with |A/| < n'^, we return our pre-computed list of the locations of 
M matches in CdMj -I- occ) time, or 0{occ) time if we are given M as a Parikh 
vector. Given a multiset M with |Af | > n^, we search G in 0{n) — 0[\M\^^'^) 
time. 

B Proofs of Lemmas [2] and [3] 

Lemma 2. We can store G in 0{n) space such that q rank queries on the path 
between any two nodes take a total of 0{logn + q) time. 

Proof. We compute the heavy-path decomposition [10] of G and store 0(l)-time 
rank data structures for each of the heavy paths, which takes 0{n) space. The 
path between any two nodes u and u is a sequence of ©(log n) intervals of heavy 
paths. Given u and v, for each of these intervals we compute the number of white 
nodes in that interval and to either side of it in the heavy path; this takes a total 
of O{\ogn) time and rank queries on heavy paths. With this information we can 
perform any rank query on the path from u to u using a single rank query on a 
heavy path. D 

Lemma 3. We can store G in 0{n) space such that, if G contains a connected 
subgraph of size m with w white nodes, then we can represent some such subgraph 
in 0(111) space such that recovering this subgraph takes 0{m) time. 

Proof. We store the adjacency lists for G's nodes, with each list ordered such 
that black neighbours precede white neighbours. With this representation, we 
can expand a subgraph by adding only black nodes as long as this is possible, 
using 0(1) time per added node. 

Let H he a connected subgraph of size m with w white nodes. We store 
pointers to the white nodes in H, which takes ©(w) space. Since G is a tree, we 
can find the unique paths between these nodes in a total of m time; notice these 
paths are contained in H and consist of black nodes. If the subgraph consisting 
of the white nodes and these paths has fewer than m nodes, then we add black 
nodes until it has m nodes, which takes a total of 0{m) time. It is possible to 
add enough black nodes without adding any white nodes because, e.g., we could 
add the remaining black nodes in H. D 



